Star Wars: The Data Awakens

Welcome to Cornell College!

Tyler George

Introduction’s

  • Professor Tyler George

  • Student Robin Gillette

Famous Quotes

  • Who has a favorite Star Wars quote?

  • Would you say that quote is positive or negative?

  • Famous quotes, according to The Hollywood Reporter

    • Luke, I am your father
    • Just kidding, that is not actually the real quote!

Text Analysis

  • A popular technique in Data Science is text analysis.

  • What do you think text analysis is (beyond analysis of text)?

Sentiment Analysis

  • Today you will all be trying your hand at a type of text analysis called sentiment analysis.

  • Any educated guesses on what sentiment analysis is?

Activity

  • At each of your tables, you have:
    • Star Wars Movie quotes
    • What we call a lexicon dictionary of sentiments.
      • This one is called Bing
      • All words are either positive or negative

Activity Instructions

  • Fill in the table with each word of your quotes on the left and their sentiment (positive or negative) on the right.

  • Count the number of positives and negatives and write your totals at the bottom.

  • Scan the QR code and enter your movie (IV, V, or VI). Then, you’ll see the number of positive sentiment words and negative sentiment words in your quotes.

    • If extra sheets are lying around, feel free to work on those quotes, too!

    • You can also respond at bit.ly/SWTDASent.

Let’s Dive Deeper

  • What would we need to do if we wanted to understand the sentiment of a character or a movie as a whole?

  • At each table is 1 computer connected to the TVs at those desks.

  • Each computer has a program called RStudio on the screen, which allows you to program in the language R.

  • R is a statistical analysis programming language and is one of the two most common languages in the field of data science (Python is the other)

What you will need…

A row of a dataset runs left to right. A column of a dataset is verticle (think the lettered columns in a Google Sheet).

  • filter: This function is keeping rows with words spoken by “Vader”

  • count: This function is counting how many times each word appeared

  • group_by and slice_max: These functions are taking the counts, keeping the top 5 most common words that have positive sentiment and that have negative sentiment.

A More Advanced Analysis

  • Dennis Bakhuis scraped all of Wookipedias information on Star Wars

  • He posted all of his work HERE.

  • One amazing result is a network plot.

Acknowledgements

  • Star Wars is owned by Lucasfilms. I do not own any of the rights to this information.

  • Tidy Text by Julia Silge and David Robinson

  • Kaggle Report by Xavier Vivancos García